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Abstract 



Questions of understanding and quantifying the representation and amount of informa- 
tion in organisms have become a central part of biological research, as they potentially hold 
the key to fundamental advances. In this paper, we demonstrate the use of information- 
theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that 
are statistically correlated. We develop a precise and reliable methodology, based on the 
notion of mutual information, for finding and extracting statistical as well as structural 
dependencies. A simple threshold function is defined, and its use in quantifying the level of 
significance of dependencies between biological segments is explored. These tools are used 
in two specific applications. First, for the identification of correlations between different 
parts of the maize zmSRp32 gene. There, we find significant dependencies between the 
5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation 
may indicate the presence of as-yet unknown alternative splicing mechanisms or structural 
scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we 
demonstrate that our approach is particularly well suited for the problem of discovering 
short tandem repeats, an application of importance in genetic profiling. 

Key Words: sequence analysis, statistical correlation, statistical significance, mutual informa- 
tion, exon, intron, alternative splicing, untranslated regions, error correction, tandem repeats. 
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1 Introduction 



Questions of quantification, representation, and description of the overall flow of information 
in biosystems are of central importance in the life sciences. In this paper, we develop statistical 
tools based on information-theoretic ideas, and demonstrate their use in identifying informa- 
tive parts in biomolecules. Specifically, our goal is to detect statistically dependent segments 
of biosequences, hoping to reveal potentially important biological phenomena. It is well-known 
[SI [TUl [2T] that various parts of biomolecules, such as DNA, RNA, and proteins, are signifi- 
cantly (statistically) correlated, although formal measures and techniques for quantifying these 
correlations are topics of current investigation. The biological implications of these correla- 
tions are deep, and they themselves remain unresolved. For example, statistical dependencies 
between exons carrying protein coding sequences and noncoding introns may indicate the exis- 
tence of as-yet unknown error correction mechanisms or structural scaffolds. Thus motivated, 
we propose to develop precise and reliable methodologies for quantifying and identifying such 
dependencies, based on the information-theoretic notion of mutual information. 

Biomolecules store information in the form of monomer strings such as deoxyribonu- 
cleotides, ribonucleotides, and amino acids. As a result of numerous genome and protein 
sequencing efforts, vast amounts of sequence data is now available for computational analysis. 
While basic tools such as BLAST provide powerful computational engines for identification 
of conserved sequence motifs, they are less suitable for detecting potential hidden correlations 
without experimental precedence (higher-order substitutions). 

The application of analytic methods for finding regions of statistical dependence through 
mutual information has been illustrated through a comparative analysis of the 5' untranslated 
regions of DNA coding sequences [15]. It has been known that eukaryotic translational ini- 
tiation requires the consensus sequence around the start codon defined as the Kozak's motif 
[TT] . By screening at least 500 sequences, an unexpected correlation between positions -2 and 
-1 of the Kozak's sequence was observed, thus implying a novel translational initiation signal 
for eukaryotic genes. This pattern was discovered using mutual information, and not detected 
by analyzing single-nucleotide conservation. In other relevant work, neighbor-dependent sub- 
stitution matrices were applied to estimate the average mutual information content of the core 
promoter regions from five different organisms [161 [T7j . Such comparative analyses verified the 
importance of TATA-boxes and transcriptional initiation. A similar methodology elucidated 
patterns of sequence conservation at the 3' untranslated regions of orthologous genes from hu- 
man, mouse, and rat genomes |20j . making them potential targets for experimental verification 
of hidden functional signals. 

In a different kind of application, statistical dependence techniques find important appli- 
cations in the analysis of gene expression data. Typically, the basic underlying assumption 
in such analyses is that genes expressed similarly under divergent conditions share functional 
domains of biological activity. Establishing dependency or potential relationships between sets 
of genes from their expression profiles holds the key to the identification of novel functional 
elements. Statistical approaches to estimation of mutual information from gene expression 
datasets have been investigated in [21j. 

Protein engineering is another important area where statistical dependency tools are uti- 
lized. Reliable predictions of protein secondary structures based on long-range dependencies 
may enhance functional characterizations of proteins [2] . Since secondary structures are deter- 
mined by both short- and long-range interactions between single amino acids, the application 
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of comparative statistical tools based on consensus sequence algorithms or short amino acid 
sequences centered on the prediction sites, is far from optimal. Analyses that incorporate 
mutual information estimates may provide more accurate predictions. 

In this work we focus on developing reliable and precise information-theoretic methods for 
determining whether two biosequences are likely to be statistically dependent. Another moti- 
vating factor for this project, which is more closely related to ideas from information theory, 
is the question of determining whether there are error correction mechanisms built into large 
molecules, as argued by Battail; see [3] and the references therein. We choose to work with 
protein coding exons and non-coding introns. While exons are well conserved parts of DNA, 
introns have much greater variability. They are dispersed on strings of biopolymers and still 
they have to be precisely identified in order to produce biologically relevant information. It 
seems that there is no external source of information but the structure of RNA molecules them- 
selves to generate functional templates for protein synthesis. Determining potential mutual 
relationships between exons and introns may justify additional search for still unknown factors 
affecting RNA processing. 

The complexity and importance of the RNA processing system is emphasized by the largely 
unexplained mechanisms of alternative splicing, which provide a source of substantial diversity 
in gene products. The same sequence may be recognized as an exon or an intron, depending 
on a broader context of splicing reactions. The information that is required for the selection of 
a particular segment of RNA molecules is very likely embedded into either exons or introns, or 
both. Again, it seems that the splicing outcome is determined by structural information carried 
by RNA molecules themselves, unless the fundamental dogma of biology (the unidirectional 
flow of information from DNA to proteins) is to be questioned. 

Finally, the constant evolution of genomes introduces certain polymorphisms, such as tan- 
dem repeats, which are an important component of genetic profiling applications. We also 
study these forms of statistical dependencies in biological sequences using mutual information. 

In Section[2]we develop some theoretical background, and we derive a threshold function for 
testing statistical significance. This function admits a dual interpretation either as the classical 
log-likelihood ratio from hypothesis testing, or as the "empirical mutual information." 

Section [3] contains our experimental results: In Section [3. II we present our empirical findings 
on the problem of detecting statistical dependency between different parts in a DNA sequence. 
Extensive numerical experiments were carried out on certain regions of the maize zmSRp32 
gene [7], which is functionally homologous to the human ASF/SF2 alternative splicing factor. 
The efficiency of the empirical mutual information in this context is demonstrated. Moreover, 
our findings suggest the existence of a biological connection between the 5' untranslated region 
in zmSRp32 and its alternatively spliced exons. 

Finally, in Section [3.21 we show how the empirical mutual information can be utilized in the 
difficult problem of searching DNA sequences for short tandem repeats (STRs), an important 
task in genetic profiling. We extend the simple hypothesis test of the previous sections to a 
methodology for testing a DNA string against different "probe" sequences, in order to detect 
STRs both accurately and efficiently. Experimental results on DNA sequences from the FBI's 
Combined DNA Index System (CODIS) are presented, showing that the empirical mutual 
information can be a powerful tool in this context as well. 
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2 Theoretical Background 



In this section we outline the theoretical basis for the mutual information estimators we will 
later apply to biological sequences. 

Suppose we have two strings of unequal lengths, 

Xi = Xi , X2 , . . . , Xn (1) 

Yf = yi,i2,i3, (2) 

where M > n, taking values in a common finite alphabet A. In most of our experiments, 
M is significantly larger than n; typical values of interest are n « 80 and M « 300. Our 
main goal is to determine whether or not there is some form of statistical dependence between 
them. Specifically, we assume that the string Xf consists of independent and identically 
distributed (i.i.d.) random variables Xi with common distribution P(x) on A, and that the 
random variables Yi are also i.i.d. with a possibly different distribution Q{y). Let {Vl^(y|x)} 
be a family of conditional distributions, or "channel," with the property that, when the input 
distribution is P, the output has distribution Q, that is, X^xeA -^(^)^(yl^) ~ Qiv)^ for all y. 
We wish to differentiate between the following two scenarios: 

(I) Independence. X^ and Y^ are independent. 

(II) Dependence. First X" is generated. Then an index J £ {1, 2, . . . , M — n + 1} is chosen 
in an arbitrary way, and Yj~^"'~^ is generated as the output of the discrete memoryless channel 
W with input X"; that is, for each j = 1, 2, . . . , n, the conditional distribution of Y^_|_j_i given 
Xf is W{y\Xj). Finally the rest of the l^'s are generated i.i.d. according to Q. [To avoid the 
trivial case where both scenarios are identical, we assume that the rows of W are not all equal 
to Q so that in the second scenario Xf and yy+"~i are actually not independent.] 

It is important at this point to note that, although neither of these two cases is biologically 
realistic as a description of the elements in a genomic sequence, it turns out that this set of 
assumptions provides a good operational starting point: The experimental results reported in 
Section [3] clearly indicate that, in practice, the resulting statistical methods obtained under 
the present assumptions can provide accurate and biologically relevant information. 

To distinguish between (I) and (II), we look at every possible alignment of Xf with Y-^ , 
and we estimate the mutual information between them. Recall that for two random variables 
X, y with marginal distributions P{x),Q{y), respectively, and joint distribution V{x,y), the 
mutual information between X and Y is defined as, 

HX;Y)=^V(.,y)<.,^^^. (3) 

Recall also |5j that /(X; Y) is always nonnegative, and it equals zero if and only if X and Y are 
independent. The logarithms above and throughout the paper are taken to base 2, log = log2, 
so that /(X; Y) can be interpreted as the number of bits of information that each of these two 
random variables carries about the other; cf. [5]. 

In order to distinguish between the two scenarios above, we compute the empirical mutual 
information between X" and each contiguous substring of Y^^^ of length n: For each j = 
1, 2, . . . , M — n + 1, let pj{x,y) denote the joint empirical distribution of (Xf ,1^^'*'" ^), i.e., 
let Pj{x, y) be the proportion of the n positions in (Xi, Y^), (X2, Y^+i), . . . , (X„, where 
(Xj, Y^_i_j_i) equals (x,y). Similarly, let p{x) and qj{y) denote the empirical distributions of 
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X" and Yj^"^ ""^i respectively. We define the empirical (per-symbol) mutual information ij{n) 
between X" and y^/^""^^ by applying ([3|) to the empirical instead of the true distributions, so 
that, 

The law of large numbers implies that, as n ^ oo, we have p{x) P{x), qj{y) Q{x), and 
Pj{x,y) converges to the true joint distribution of X,Y. 

Clearly, this implies that in scenario (I), where and are independent, Ij{n) 0, 
for any fixed j, as n — > cxd. On the other hand in scenario (II), Ij{n) converges to I{X; Y) > 
where the two random variables X, Y are such that X has distribution P and the conditional 
distribution of Y given X = x is W{y\x). 

2.1 An Independence Test Based on Mutual Information 

We propose to use the following simple test for detecting dependence between and Y"/^. 
Choose and fix a threshold 9 > 0, and compute the empirical mutual information Ij{n) between 
and each contiguous substring Yj~^"'~^ of length n from Y^ . If Ij{n) is larger than 9 for 

some j, declare that the strings Xf and Y^^^~^ are dependent; otherwise, declare that they 
are independent. 

Before examining the issue of selecting the value of the threshold 9, we note that this 
statistic is identical to the (normalized) log-likelihood ratio between the above two hypotheses. 
To see this, observe that, expanding the definition pj{x,y) in Ij{n), we can simply rewrite, 

n 



where the indicator function I{(x,,Y,+j_i)}(^) y) equals 1 if (Xj,y^+j_i) = {x,y), and it is equal 
to zero otherwise. Then, 

= -Z^^°g ^( X\n (Y T 

P[^i)Qj{J^j+i~l) 



-log 

n 



m=lPJix^^yJ+^-l) 
nr=iP(x,)g,(y, 



which is exactly the normalized logarithm of the ratio between the joint empirical likelihood 
nr=i Pji-^i^^j+i-i) of the two strings, and the product of their empirical marginal likelihoods 

[u:=lP{x^)m=lm+^-l)]■ 



2.2 Probabilities of Error 

There are two kinds of errors this test can make: Declaring that two strings are dependent 
when they are not, and vice versa. The actual probabilities of these two types of errors depend 
on the distribution of the statistic Ij{n). Since this distribution is independent of j, we take 
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j = 1 and write I{n) for the normalized log-likelihood ratio Ii{n). The next two subsections 
present some classical asymptotics for /i(n). 



Scenario (I): Independence. We already noted that in this case /(n) converges to zero 
as n — > oo, and below we shall see that this convergence takes place at a rate of approximately 
1/n. Specifically, I{n) with probability one, and a standard application of the multivariate 
central limit theorem for the joint empirical distribution pj shows that nl{n) converges in 
distribution to a (scaled) random variable. This a classical result in statistics [12^ 118]. and, 
in the present context, it was rederived by Hagenauer et al. [21[H]. We have, 

(21n2)n/(n)^Z~x'((l^|-l)'), 

where Z has a distribution with k = (\A\ — 1)^ degrees of freedom, and where \A\ denotes 
the size of the data alphabet. 

Therefore, for a fixed threshold > and large n, we can estimate the probability of error 

as, 

Pe,i = Prjdeclare dependence | independent strings} 
= Pr{/(n) > 9 I independent strings} 

« Pr{Z > (21n2)0n}, (6) 

where Z is as before. Therefore, for large n the error probability Pe^i decays like the tail of 
the distribution function, 

7(/c, (6'ln2)n) 



e,l 



1 



where k = - — ^ ' , and F, 7 denote the Gamma function and the incomplete Gamma function, 
respectively. Although this is fairly implicit, we know that the tail of the x^ distribution decays 
like e~^/^ as 2; ^ 00, therefore, 

Pe,i~exp{-(01n2)n}, (7) 
where this approximation is to first order in the exponent. 



Scenario (II): Dependence. In this case the asymptotic behavior of the test statistic 
/(n) is somewhat different. Suppose as before that the random variables are i.i.d. with 
distribution P, and that the conditional distribution of each Yi given is W{y\Xi)^ for some 
fixed family of conditional distributions VF(y|x); this makes the random variables i.i.d. 
with distribution Q. 

We mentioned in the last section that, under the second scenario, I{n) converges to the 
true underlying value / = I{X; Y) of the mutual information, but, as we show below, the rate 
of this convergence is slower than the 1/n rate of scenario (I): Here, I{n) — > / with probability 
one, but only at rate Xj ^fn^ in that, ^Jn\l{n) — /] converges in distribution to a Gaussian, 

V^[/(n) - /] ^ y ~ iV(0, (T^), (8) 
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where the resulting variance cj^ is given by, 

- -( ^) ^E/™ ( 1!? - ') '^^ 

[An outhne of the proof of ^ is given below.] 

Therefore, for any fixed threshold 6 < I and large n, the probability of error satisfies, 

Pe,2 = Prjdeclare independence | VF-dependent strings} 

= Pr{/(n) < 9 I W-dependent strings} 

« Fv{V < [9 - I]^/^} 

~ exp{-^^^-P^n}, (9) 

where the last approximation sign indicates equality to first order in the exponent. Thus, 
despite the fact that I{n) converges at different speeds in the two scenarios, both error prob- 
abilities Pe,i and Pe,2 decay exponentially with the sample size n. 

To see why ([8]) holds it is convenient to use the alternative expression for I{n) given in ([5]). 
Using this, and recalling that I(n) = Ii{n), we obtain. 



vi[/(n)-/i=vH[ii:>»«|ftfy-^; 



1=1 



Since the empirical distributions converge to the corresponding true distributions, for large n 
it is straightforward to justify the approximation. 

The fact that this indeed converges in distribution to a iV(0, cr^), as n ^ oo, easily follows 
from the central limit theorem, upon noting that the mean of the logarithm in (jlOp equals / 
and its variance is a^. 



Discussion. From the above analysis it follows that, in order for both probabilities of error 
to decay to zero for large n (so that we rule out false positives as well as making sure that 
no dependent segments are overlooked) the threshold 9 needs to be strictly between and 
/ = I{X;Y). For that, we need to have some prior information about the value of /, i.e., of 
the level of dependence we are looking for. If the value of / were actually known and a fixed 
threshold 9 £ (0, 1) was chosen independent of n, then both probabilities of error would decay 
exponentially fast, but with typically very different exponents, 

Pe,i ~ exp{— (0 In 2) n} and Pe,2 ~ exp | — ("^/^) 

recall the expressions in ([7]) and ([9]). Clearly, balancing the two exponents also requires knowl- 
edge of the value of o"^ in the case when the two strings are dependent, which, in turn, requires 
full knowledge of the marginal distribution P and the channel W. Of course this is unreason- 
able, since we cannot specify in advance the exact kind and level of dependence we are actually 
trying to detect in the data. 
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A practical (and standard) approach is as follows: Since the probability of error of the 
first kind Pi^e only depends on 9 (at least for large n), and since in practice declaring false 
positives is much more undesirable than overlooking potential dependence, in our experiments 
we decide on an acceptably small false-positive probability e, and then select 9 based on the 
above approximation, by setting Pe,i « e in (0). 

3 Experimental Results 

In this section we apply the mutual information test described above to biological data. First 
we show that it can be used effectively to identify statistical dependence between regions of the 
maize zmSRp32 gene that may be involved in alternative processing (splicing) of pre-mRNA 
transcripts. Then we show how the same methodology can be easily adapted to the problem 
of identifying tandem repeats. We present experimental results on DNA sequences from the 
FBI's Combined DNA Index System (CODIS), which clearly indicate that the empirical mutual 
information can be a powerful tool for this computationally intensive task. 

3.1 Detecting DNA Sequence Dependencies 

All of our experiments were performed on the maize zmSRp32 gene [7]. This gene belongs to 
a group of genes that are functionally homologous to the human ASF /SF2 alternative splicing 
factor. Interestingly, these genes encode alternative splicing factors in maize and yet themselves 
are also alternatively spliced. The gene zmSRp32 is coded by 4735 nucleotides and has four 
alternative splicing variants. Two of these four variants are due to different splicings of this 
gene, between positions 1-369 and 3243-4220, respectively, as shown in Figure [TJ The results 
given here are primarily from experiments on these segments of zmSRp32. 

In order to understand and quantify the amount of correlation between different parts of this 
gene, we computed the mutual information between all functional elements including exons, 
introns, and the 5' untranslated region. As before, we denote the shorter sequence of length n 
hyXf = {Xi,X2,...,Xn) and the longer one of length M by Y^^^ = {Yi,Y2, . . . ,Ym). We apply 
the simple mutual information estimator Ij (n) defined in ^ to estimate the mutual information 
between X" and Y^^^^^ for each j = 1,2,... ,M — n + 1, and we plot the "dependency graph" 

oi Ij = Ij{n) versus j; see Figure[2l The threshold 9 is computed according to ([6]), by setting e, 
the probability of false positives, equal to 0.001; it is represented by a (red) straight horizontal 
line in the figures. 

In order to "amplify" the effects of regions of potential dependency in various segments 
of the zmSRp32 gene, we computed the mutual information estimates Ij on the original 
strings over the regular four-letter alphabet {A,C,G,T}, as well as on transformed versions 
of the strings where pairs of letters were grouped together, using either the Watson-Crick 
pair {AT, CG} or the purine-pyrimidine pair {AG, CT}. In our results we observed that such 
groupings are often helpful in identifying dependency; this is clearly illustrated by the esti- 
mates shown in Figures [2] and [3l Sometimes the {AT,CG} pair produces better results, while 
in other cases the purine-pyrimidine pair finds new dependencies. 

Figure [2] strongly suggests that there is significant dependence between the bases in posi- 
tions 1-369 and certain substrings of the bases in positions 3243-4220. While the 1-369 region 
contains the 5' untranslated sequences, an intron, and the first protein coding exon, the 3243- 
4220 sequence encodes an intron that undergoes alternative splicing. After narrowing down 
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Figure 1: Alternative splicings of the zmSRp32 gene in maize. The gene consists of a number 
of exons (shaded boxes) and introns (hues) flanked by the 5' and 3' untranslated regions 
(white boxes). RNA transcripts (pre-mRNA) are processed to yield mRNA molecules used 
as templates for protein synthesis. Alternative pre- mRNA splicing generates different mRNA 
templates from the same transcripts, by selecting either alternative exons or alternative introns. 
The regions discussed in the text are identified by indices corresponding to the nucleotide 
position in the original DNA sequence. 
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Figure 2: Estimated mutual information between the exon located between bases 1-369 and 
each contiguous subsequence of length 369 in the intron between bases 3243-4220. The es- 
timates were computed both for the original sequences in the standard four-letter alphabet 
{A,C,G,T} (shown in (a)), as well as for the corresponding transformed sequences for the 
two-letter purine/pyrimidine grouping {AG,CT} (shown in (b)). 



9 



the mutual information calculations to the 5' untranslated region (5'UTR) in positions 1-78 
and the 5'UTR intron in positions 78-268, we found that the initially identified dependency 
was still present; see Figure [3l A close inspection of the resulting mutual information graphs 
indicates that the dependency is restricted to the alternative exons embedded into the intron 
sequences, in positions 3688-3800 and 3884-4254. 

These findings suggest that there might be a deeper connection between the 5'UTR DNA 
sequences and the DNA sequences that undergo alternative splicing. The UTRs are multi- 
functional genetic elements that control gene expression by determining mRNA stability and 
efficiency of mRNA translation. Like in the zmSRp32 maize gene, they can provide multiple 
alternatively spliced variants for more complex regulation of mRNA translation [10] . They also 
contain a number of regulatory motifs that may affect many aspects of mRNA metabolism. 
Our observations can therefore be interpreted as suggesting that the maize zmSRp32 5'UTR 
contains information that could be utilized in the process of alternative splicing, yet another 
important aspect of mRNA metabolism. The fact that the value of the empirical mutual in- 
formation between 5'UTR and the DNA sequences that encode alternatively spliced elements 
is significantly greater than zero clearly points in that direction. Further experimental work 
could be carried out to verify the existence, and further explore the meaning, of these newly 
identified statistical dependencies. 

We should note that there are many other sequence matching techniques, the most popular 
of which is probably the celebrated BLAST algorithm. BLAST'S working principles are very 
different from those underlying our method. As a first step, BLAST searches a database of 
biological sequences for various small words found in the query string. It identifies sequences 
that are candidates for potential matches, and thus eliminates a huge portion of the database 
containing sequences unrelated to the query. In the second step, small word matches in every 
candidate sequence are extended by means of a Smith- Waterman-type local alignment algo- 
rithm. Finally, these extended local alignments are combined with some scoring schemes, and 
the highest scoring alignments obtained are returned. Therefore, BLAST requires a consider- 
able fraction of exact matches to find sequences related to each other. However, our approach 
does not enforce any such requirements. For example, if two sequences do not have any exact 
matches at all, but the characters in one sequence are a character-wise encoding of the ones 
in the other sequence, then BLAST would fail to produce any significant matches (without 
corresponding substitution matrices) , while our algorithm would detect a high degree of depen- 
dency. This is illustrated by the results in the following section, where the presence of certain 
repetitive patterns in Y-^^ is revealed through matching it to a "probe sequence" which 
does not contain the repetitive pattern, but is "statistically similar" to the pattern sought. 

3.2 Application to Tandem Repeats 

Here we further explore the utility of the mutual information statistic, and we examine its 
performance on the problem of detecting Short Tandem Repeats (STRs) in genomic sequences. 
STRs, usually found in non-coding regions, are made of back-to-back repetitions of a sequence 
which is at least two bases long and generally shorter than 15 bases. The period of an STR 
is defined as the length of the repetition sequence in it. Owing to their short lengths, STRs 
survive mutations well, and can easily be amplified using PGR without producing erroneous 
data. Although there are many well-identified STRs in the human genome, interestingly, the 
number of repetitions at any specific locus varies significantly among individuals; that is, they 
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Figure 3: Dependency graph of Ij versus j for the zmSRp32 gene, using different alphabet 
groupings: In (a) and (b), we plot the estimated mutual information between the exon found 
between bases 1-78 and each subsequence of length 78 in the intron located between bases 
3243-4220. Plot (a) shows estimates over the original four-letter alphabet {A, C, G, T}, and (b) 
shows the corresponding estimates over the Watson-Crick pairs {AT,CG}. Similarly, plots (c) 
and (d) contain the estimated mutual information between the intron located in bases 79- 
268 and all corresponding subsequences of the intron between bases 3243-4220. Plot (c) 
shows estimates over the original alphabet, and plot (d) over the two-letter purine/pyrimidine 
grouping {AG,CT}. Plots (e) and (f) show the estimated mutual information between the 
5' untranslated region and all corresponding subsequences of the intron between bases 3243- 
4220, for the four- letter alphabet (in (e)), and for the two- letter purine/pyrimidine grouping 
{AG,GT} (in(f)). 11 



are polymorphic DNA fragments. These properties make STRs suitable tools for determining 
genetic profiles, and have become a prevalent method in forensic investigations. Long repetitive 
sequences have also been observed in genomic sequences, but have not gained as much attention 
since they cannot survive environmental degradation and do not produce high quality data from 
PGR analysis. 

Several algorithms have been proposed for detecting STRs in long DNA strings with no 
prior knowledge about the size and the pattern of repetition. These algorithms are mostly 
based on pattern matching, and they all have high time-complexity. Finding short repetitions 
in a long sequence is a challenging problem. When the query string is a DNA segment that 
contains many insertions, deletions or substitutions due to mutations, the problem becomes 
even harder. Exact- and approximate-pattern matching algorithms need to be modified to 
account for these mutations, and this renders them complex and inefficient. To overcome these 
limitations, we propose a statistical approach using an adaptation of the method described in 
the previous sections. 

In the United States, the FBI has decided on 13 loci to be used as the basis for genetic 
profile analysis, and they continue to be the standard in this area. To demonstrate how our 
approach can be used for STR detection, we chose to use sequences from the FBI's Combined 
DNA Index System (CODIS): The SE33 locus contained in the GenBank sequence V00481, and 
the VWA locus contained in the GenBank sequence M25858. The periods of STRs found in 
GODIS typically range from 2 to 4 bases, and do not exhibit enough variability to demonstrate 
how our approach would perform under divergent conditions. For this reason, we used the 
V00481 sequence as is, but on M25858 we artificially introduced an STR with period 11, by 
substituting bases 2821-2920 (where we know that there are no other repeating sequences) 
with 9 tandem repeats of ACTTTGCCT AT . We have also introduced base substitutions, 
deletions and insertions on our artificial STR to imitate mutations. 

Let Y^^ = (Yi, ^2, • • • , Ym) denote the DNA sequence in which we are looking for STRs. 
The gist of our approach is simply to choose a periodic probe sequence of length n, say, 
X" = {Xi,X2, . . . , Xn) (typically much shorter than Y-^), and then to calculate the empirical 
mutual information = Ij{n) between X^ and each of its possible alignments with Y-^'^ . 
In order to detect the presence of STRs, the values of the empirical mutual information in 
regions where STRs do appear should be significantly larger than zero, where "significantly" 
means larger than the corresponding estimates in ordinary DNA fragments containing no STRs. 
Obviously, the results will depend heavily on the exact form of the probe sequence. Therefore, 
it is critical to decide on the method for selecting: (a) the length, and (b) the exact contents 
of Xf. The length of X^ is crucial; if it is too short, then itself is likely to appear often 
in y/^, producing many large values of the empirical mutual information and making it hard 
to distinguish between STRs and ordinary sequences. Moreover, in that case there is little 
hope that the analysis of the previous section (which was carried out of long sequences X") 
will provide useful estimates for the probability of error. If, on the other hand, X" is too 
long, then any alignment of the probe X" with Y^^ will likely also contain too many irrelevant 
base pairs. This will produce negligibly small mutual information estimates, again making 
impossible to detect STRs. These considerations are illustrated by the results in Figure [H 

As for the contents of the probe sequence Xf , the best choice would be to take a segment 
X" containing an exact match to an STR present in Y^ . But in most of the interesting 
applications, this is of course unavailable to us. A "second best" choice might be a sequence 
X" that contains a segment of the same "pattern" as the STR present in Y-^'^ , where we say 
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Figure 4: Dependency graph of the GenBank sequence yA^ = y00481, for a probe sequence X" 
which is a repetition of AGGT, of length: (a) 12, or (b) 60. The sequence Y-^'^ contains STRs 
that are repetitions of the pattern AAAG, in the following regions: (i) there is a repetition 
of AAAG between bases 62-108; (ii) AAAG is intervened by AG and AAGG until base 138; 
(iii) again between 138-294 there are repetitions of AAAG, some of which are modified by 
insertions and substitutions. In (a) our probe is too short, and it is almost impossible to 
distinguish the SE33 locus from the rest. However, in (b) the location SE33 is singled out 
by the two big peaks in the mutual information estimates; the shorter peak between the two 
larger ones is due to the interventions described above. Note that the STRs were identified by 
a probe sequence that was a repetition of a pattern different from that of the repeating part 
of the STRs themselves, but of the same period. 
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Figure 5: Dependency graph of the VWA locus contained in GenBank sequence M25858 for 
a probe sequence X" with n = 12, which is a repetition of: (a) TCTA, an exactly matching 
probe, (b) GTGC, a completely different probe, but of the exact same "pattern." In both 
cases, we have chosen to be long enough to suppress unrelated information. Note that 
the results in (a) and (b) are almost identical. The VWA locus contains an STR of TCTA 
between positions 44-123. This STR is apparent in both dependency graphs by forming a 
periodic curve with high correlation. 
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that two sequences have the same pattern if each one can be obtained from the other via a 
permutation of the letters in the alphabet; cf. [HIT^]. For example, TCTA and GTGC have 
the same pattern, whereas TCTA and CTAT do not (although the do have the same empirical 
distribution). For example, if contains the exact same pattern as the periodic part of the 
STR to be detected, and X" has the same pattern as X", then, a priori, either choice should 
be equally effective at detecting the STR under consideration; see Figure [5l [This observation 
also shows that a single probe X" may in fact be appropriate for locating more than a single 
STR; for example, STRs with the same pattern as X", as in Figure O or with the same period, 
as in Figure m] The problem with this choice is, again, that the exact patterns of STRs present 
in a DNA sequence are not available to us in advance, and we cannot expect all STRs in a 
given sequence to be of the same pattern. 

Even though both of the above choices for X" are usually not practically feasible, if the 
sequence Y-^ is relatively short and contains a single STR whose contents are known, then 
either choice would produce high quality data, from which the STR contained in Y^ ^ we can 
easily be detected; see Figure [5] for an illustration. 

In practice, in addition to the fact that the contents of STRs are not known in advance, 
there is also the issue that in a long DNA sequence there are often many different STRs, and a 
unique probe will not match all of them exactly. But since STRs usually have a period between 
2 and 15 bases, we can actually run our method for all possible choices of repetition sequences, 
and detect all STRs in the given query sequence Y^^ . The number of possible probes X" 
can be drastically by observing that (1) We only need one repeating sequence of each possible 
pattern; and (2) It suffices to only consider repetition patters whose period is prime. Note that, 
in view of the earlier discussion and the results shown in Figure [H the period of the repeating 
part of X" is likely to be more important than the actual contents. For example, if we were 
to apply our method for finding STRs in Y-^ with a probe X" whose period is 5 bases long, 
then many STRs with a period that is a multiple of 5 should peak in the dependency chart, 
thus allowing us to detect their approximate positions in Y-^ . Clearly, probes that consist of 

very short repeats, such as AAA , should be avoided. The importance of choosing an X" 

with the correct period is illustrated in Figure El 

The results in Figures HI [5] and [6] clearly indicate that the proposed methodology is very 
effective at detecting the presence of STRs, although at first glance it may appear that it cannot 
provide precise information about their start-end positions and their repeat sequences. But 
this final task can easily be accomplished by re-evaluating Y-^^ near the peak in the dependency 
graph, for example, by feeding the relevant parts separately into one of the standard string 
matching-based tandem repeat algorithms. Thus, our method can serve as an initial filtering 
step which, combined with an exact pattern matching algorithm, provides a very accurate and 
efficient method for the identification of STRs. 

In terms of its practical implementation, note that our approach has a linear running time 
0(M), where M is the length of Y^ . The empirical mutual information of course needs to 
be evaluated for every possible alignment of Y^^ and X", with each such calculation done in 
0{n) steps, where n is the length of X". But n is typically no longer than a few hundred 
bases, and, at least to first order, it can be considered constant. Also, repeating this process 
for all possible repeat periods does not affect the complexity of our method by much, since 
the number of such periods is quite small and can also be considered to be constant. And, as 
mentioned above, choosing probes Xf only containing repeating segments with a prime period, 
further improves the running time of our method. 
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Figure 6: In these charts we use the modified GenBank sequence M25858, which contains the 
VWA locus in CODIS between positions 1683-1762 and the artificial STR introduced by us 
at 2821-2920. The repeat sequence of the VWA locus is TCTA, and the repeat sequence of 
the artificial STR is ACTTTGCCTAT. In (a), the probe has length n = 88 and consists 
of repetitions of AGGT. Here the repeating sequence of the VWA locus (which has period 4) 
is clearly indicated by the peak, whereas the artificial tandem repeat (which has period 11) 
does not show up in the results. The small peak around position 2100 is due to a very noisy 
STR again with a 4 base period. In (b), the probe X" again has length n = 88, and it consists 
of repetitions of GATAGTTCGGA. This produces the opposite result: The artificial STR is 
clearly identified, but there is no indication of the STR present at the VWA locus. 

We, therefore, conclude that: (a) the empirical mutual information appears in this case to 
be a very effective tool for detecting STRs; and (b) selecting the length and repetition period 
of the probe sequence X" is crucial for identifying tandem repeats accurately. 

4 Conclusions 

Biological information is stored in the form of monomer strings composed of conserved biomolec- 
ular sequences. According to Manfred Eigen, "The differentiable characteristic of living systems 
is information. Information assures the controlled reproduction of all constituents, thereby 
ensuring conservation of viability." Hoping to reveal novel, potentially important biological 
phenomena, we employ information-theoretic tools, especially the notion of mutual informa- 
tion, to detect statistically dependent segments of biosequences. The biological implications 
of the existance of such correlations are deep, and they themselves remain unresolved. The 
proposed approach may provide a powerful key to fundamental advances in understanding and 
quantifying biological information. 

This work addresses two specific applications based on the proposed tools. From the 
experimental analysis carried out on regions of the maize zmSRp32 gene, our findings suggest 
the existence of a biological connection between the 5' untranslated region in zmSRp32 and 
its alternatively spliced exons, potentially indicating the presence of novel alternative splicing 
mechanisms or structural scaffolds. Secondly, through extensive analysis of CODIS data, we 
show that our approach is particularly well suited for the problem of discovering short tandem 
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repeats, an application of importance in genetic profiling studies. 
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